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(57) Fault tolerance in a redundant array 13 of disk 
drives 20. 21, 22, 23, 24 is degraded when errors 
exist in the array. Data in the array is rebuilt to 
remove the degradation, either for entire disk 
drives or for partial data, with minimal impact 
on array resources. Rebuilding occurs during 
idle time of the array or is interleaved between 
current data area accessing operations of the 
array at a rate which is inversely proportional to 
the activity level of the array. Alternatively, data 
is rebuilt when a data area being accessed is in 
need of rebuilding. 
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The present invention relates to parity arrays of 
disk drives, and particularly to recovery from degrad- 
ed redundancy in such an array by rebuilding data 
from error- affected tracks. 

Patterson et al in the article "A CASE FOR RE- 
DUNDANT ARRAYS OF INEXPENSIVE DISKS 
(RAID)", ACM 1988, March 1988, describe several ar- 
rangements for using a plurality of data-storing disk 
drives. Various modes of operation are described; in 
one mode the data storage is divided among the sev- 
eral drives to effect a storage redundancy. Data to be 
stored is partially stored in a predetermined number 
of the disk drives in the array, at least one of the disk 
drives storing error detecting redundancies. For ex- 
ample, four of the disk drives may store data while a 
fifth disk drive may store parity based upon data stor- 
ed in the four disk drives. Such a redundant array of 
disk drives may provide high data availability by intro- 
ducing error correcting redundancy data in one of the 
disk drives. For example, four data blocks (one data 
block in each of the four drives) are used to compute 
an error correcting redundancy, such as a parity val- 
ue; the computed error correcting redundancy is stor- 
ed as a fifth block on the fifth drive. Alt blocks have 
the same number of data bytes and may be (not a re- 
quirement) stored in the five disk drives at the same 
relative track locations. The five drives form a parity 
group of drives. If any one of the drives in the parity 
group fails , in whole or in part, the data from the fail- 
ing drive can be reconstructed using known error cor- 
recting techniques. It is desired to efficiently rebuild 
and replace the data from the failing disk drive while 
continuing accessing the drives in the array for data 
processing operations. 

The disk drives in a parity group of drives may act 
in unison as a single logical disk drive. Such a logical 
drive has logical cylinders and tracks consisting of 
like-located cylinders and tracks in the parity group 
drives. In such array usage, the data being stored is 
partially stored in each of the data-storing drives in an 
interleaved manner in a so-called striped mode. Alter- 
nately, the disk drives and their data in the parity group 
may be independently addressable and used in a so- 
called independent mode. 

Whenever one of the disk drives in a single-parity 
array fails, even though data can be successfully re- 
covered, the fault tolerance to error conditions is lost. 
To return to a desired fault tolerant state, the failing 
disk drive must be replaced orrepaired and the affect- 
ed data content rebuilt to the desired redundancy. It 
is desired to provide control means and methods for 
effecting such rebuilding of data and its redundancy 
to remove the error from a partially or wholly failed 
disk drive in a parity array of disk drives. 

Accordingly the invention provides a method of 
automatically maintaining fault tolerance in a parity 
array consisting of a plurality of disk drives, compris- 
ing the machine-execuied steps of: 



detecting a degradation in the fault tolerance of 
the parity array; 

evaluating the current information handling ac- 
tivity of the parity array; 

5 establishing a plurality of data rebuild methods 

for removing the fault tolerance degradation from the 
parity array; and 

analyzing the current information handling ac- 
tivity of the parity array and selecting one of the plur- 

10 ality of rebuild methods to ensure that the perfor- 
mance of said current information handling activity is 
not degraded below a predetermined level. 

This approach allows a complete rebuild of data 
in a parity group of disk drives to a fault tolerant state 

15 after detecting loss or degradation of the fault tolerant 
state by a partially or wholly failed disk drive in a parity 
array of disk drives in a relatively non-intrusive man- 
ner while accesses to the parity array continue for 
data storage and retrieval. 

20 Preferred examples of said plurality of data re- 

build methods are; 

(1) measuring the rate of machine operations of 
the array; 

establishing the rate of rebuilding data af- 
25 fected by the error in inverse proportion to the 

measured rate of machine operations; and 
rebuilding the data accordingly. 

(2) detecting that the parity array is idle; 

detecting that one of the disk drives is af- 
30 fected by error; 

rebuilding the data during said detected 
idle times. 

(3) performing a data area access operation to an 
addressable data unit in the parity array; 

35 while performing the data area access op- 

eration, detecting that the addressed data unit is 
affected by error and in need of a data rebuild; 

rebuilding the addressable data unit being 
accessed. 

40 It is further preferred that the method includes: 

completing a data rebuild using the rebuild 
method of (1) above; 

detecting that the parity array is idle; and 
continuing the data rebuilding of additional ad- 
45 dressable error-affected data so long as the parity ar- 
ray is idle. 

The invention also provides apparatus for auto- 
matically maintaining fault tolerance in a parity array 
consisting of a plurality of disk drives, the apparatus 
50 comprising: 

means for detecting a degradation in fault tol- 
erance in the parity array such that one of the disk 
drives needs to have data rebuilt; 

means for measuring a rate of machine opera- 
55 tions of said array; 

means responsive to said measured opera- 
tions rate for establishing a rate of rebuilding for the 
array for recovering from said degradation; and 
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a plurality of data rebuild means for performing 
data rebuild in said one disk drive. 

It is preferred that the apparatus further compris- 
es control means for controlling access to the disk 
drives in the parity array and for detecting when the 
array is currently not being accessed for a data han- 
dling operation. Preferred examples of data rebuild 
means are: 

(1) data rebuild means connected to said control 
means and to said rate establishing means for de- 
termining whether a rebuild can be scheduled, 
and if so, activating the control means to give ac- 
cess to the parity array to perform a series of re- 
build operations at the established rate. 

(2) data rebuild means connected to said control 
means and responsive to the parity array not be- 
ing currently accessed to activate the control 
means to give access to the parity array to per- 
form a data rebuild. 

(3) data rebuild means connected to the control 
means and responsive to a disk drive access oc- 
curring in an area of the array that needs a data 
rebuild to perform a data rebuild. 

In example (1) above it is further preferred that 
the apparatus also includes means for indicating a 
plurality of rebuild rates, each corresponding to pre- 
determined ranges of machine operations rates and 
in inverse proportion to the operations rate; and 

means for indicating one of said plurality of re- 
build rates as said established rate corresponding to 
the current machine operations rate. 

Preferably each of the data rebuild means upon 
completing a data rebuild operation is responsive to 
the control means indicating no current access or no 
pending access requests to initiate another data re- 
build operation. 

Thus failures in a redundant array of disk drives 
are remedied by rebuilding the error-affected data us- 
ing any one of a plurality of techniques to allow a con- 
tinuing use of the disk drive array for information han- 
dling and data processing. One technique is a vari- 
able rate rebuild which schedules rebuilds at a rate in 
a detected inverse ratio to a current or pending rate 
of disk drive usage or accessing within a parity group. 
Upon completing each scheduled rebuild, this method 
and apparatus also preferably takes advantage of any 
idle time of the array by continuing rebuild if there is 
no waiting access. Another technique performs re- 
build during predetermined array idle times by start- 
ing a non-scheduled rebuild of a predetermined por- 
tion of the error- affected data. A third or opportunistic 
technique detects a need for a data rebuild during a 
usual access to the array. The rebuilding may use any 
or all of the above techniques in conjunction with each 
other if desirable. 

Thus data can be rebuilt onto a scratch or new 
disk drive which replaces a disk drive in error to re- 
store redundancy of the array. The above techniques 
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also apply to a partially failed disk drive in which the 
error-affected data are rebuilt in a different track or 
zone of the disk drive in error; in the latter rebuild, data 
in the non-failing disk drives may also be moved to 
5 corresponding zones or tracks in the respective 
drives. 

Thus one way of automatically maintaining fault 
tolerance of a redundant array of disk drives is by de- 
tecting that the fault tolerance of the parity array is de- 

10 graded by a plurality of error-affected addressable 
data units of the parity array which respectively need 
data rebuilding to reestablish the fault tolerance, ac- 
cessing the error affected disk drive, and during said 
access rebuilding data affected by said error. In other 

15 words, this involves performing a data area access 
operation in the parity array, and while performing the 
data area access operation, detecting that the data 
access operation is accessing one of the addressable 
data units needing a data rebuild and rebuilding at 

20 least the data unit being accessed. 

Alternative approaches are: detecting that the ar- 
ray of disk drives has no current access and rebuild- 
ing data affected by said error affected disk drive be- 
ginning upon the detection of no current access; or 

25 determining the rate of information handling activity 
and selecting a data rebuild rate in a predetermined 
inverse ratio to the determined rate of information 
handling activity and using a variable rate rebuild 
which performs data rebuilding at said selected re- 

30 build rate of a predetermined number of addressable 
error- affected data units in the parity array. 

To recap therefore, a redundant disk array pro- 
vides high data availability by introducing data redun- 
dancy such as parity check-sum across multiple 

35 drives. For example, four blocks of data of equal size 
from four different disk drives are used to compute a 
parity block of the same size and stored on a fifth 
drive. The five drives would form a parity group. With 
such an approach, when one drive in the array fails, 

40 its data can be reconstructed and made available 
from the other drives in its parity group. The drives in 
a parity group may act in unison as a single logical 
drive as a result of data being interleaved (striped) 
across the drives. This is called the stripped mode. In 

45 addition, the drives in a parity group may be used in- 
dependently. This is called the independent mode. 

When one drive in a single-parity array fails, even 
though its data can be reconstructed through the 
other drives and is, therefore, logically available, the 

50 array is no longer in a fault tolerant state. A second 
drive failure within the parity group would render the 
data on the two failed drives lost or unavailable. To re- 
turn to the fault tolerant state, the failed drive must be 
repaired and its content rebuilt. 

55 Ideally, the data rebuild procedures should satis- 

fy two objectives: 

1. it should be completed as quickly as possible 
so as to minimise the window during which the ar- 
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ray is vulnerable to data loss/unavailability due to 
a second failure. 

2. The rebuild procedure itself should be as non- 
intrusive as possible so as to minimise the impact 
on the performance of regular I/O services of the 
other disk drives in the array. 
Because of objective 2, it is not acceptable to shut 
down the array and rebuild the data on the entire drive 
at one time, even though that would clearly satisfy ob- 
jective 1. Rather, the content of the failed drive must 
be rebuilt piecewise. The present invention satisfies 
both objectives and it is applicable to arrays that op- 
erate in either the striped mode or the independent 
mode, or both. 

When a unit of data is being rebuilt, a rebuild l/P 
must 

1. read the corresponding data units from the 
other drives in the same parity group, 

2. reconstruct the missing data unit by taking the 
excIusive-OR of the corresponding data units, 
then 

3. write the reconstructed data unit to the drive 
being rebuilt. 

A rebuild algorithm must answer the following 
three questions 

(a) how much data should be rebuilt at a time (i.e. 
for each rebuild I/O)? 

(b) when should a rebuild I/O be issued? 

(c) which part of the disk should the next rebuild 
I/O rebuild? 

Rebuilding one block at a time appears to be in- 
efficient, as the size of data rebuild unit {one block) 
is small relative to the time required to rebuild the unit. 
I/O performance, however, does not suffer, as the to- 
tal duration of the rebuild I/O is short. In contrast, re- 
building one cylinder at a time may be very efficient, 
as the amount of rebuild work completed per rebuild 
I/O is large relative to the time required to rebuild the 
unit. Nevertheless, the cylinder rebuild ties up all the 
drives for a long period of time, severely degrading the 
performance of regular l/O's. Rebuilding one track at 
a time is a good compromise between the efficiency 
of the rebuild operation and its performance impact, 
although the invention will work with any reasonable 
size data rebuild unit. 

Three ways are provided to generate a rebuild 

I/O. 

Variable rate: First, rebuild l/O's can be generated 
at a variable rate depending on the current level of I/O 
activity. These are called scheduled rebuild l/O's. The 
array system monitors the rate of all regular I/O activ- 
ities and adjusts the rebuild rate accordingly to main- 
tain reasonable subsystem performance. A sched- 
uled rebuild I/O does not have to wait for the array 
subsystem to be quiescent but is injected into the 
subsystem periodically according to the rebuild rate 
prevalent at the time. This way, it can be ensured that 
rebuild activities do get scheduled even though the 
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subsystem may never be idle. A method for determin- 
ing the rebuild I/O rate will be described later. 

Furthermore, in order to make efficient use of the 
array subsystem's idle time, after a rebuild I/O has 
5 completed the rebuild of one track (or rebuild data 
unit) and there is no regular I/O waiting for any of the 
drives in the parity group, the rebuild I/O can be al- 
lowed to continue on to the next track until the next 
regular I/O arrives. When the next regular I/O arrives, 

10 the rebuild I/O can be immediately terminated (cur- 
rent track not rebuilt). Therefore, a scheduled rebuild 
I/O rebuilds a minimum of one track (or rebuild data 
unit) a time, but may rebuild optimally more if the sub- 
system has idle time. 

15 When Idle: Second, when the array subsystem 

detects that there is no I/O activity on the affected par- 
ity group, it can generate a rebuild I/O. The rebuild I/O 
can be of of one of two types, scheduled or non- 
scheduled. If the subsystem has not generated the re- 

20 quired number of rebuild l/O's for the current time per- 
iod, then it can generate the next scheduled rebuild 
I/O. This rebuild I/O works in exactly the same manner 
as described above, i.e. it is not interruptible until at 
least one track (or rebuild data unit) has been rebuilt. 

25 If the subsystem has already generated the required 
number of rebuild l/O's for the current time period, 
then it can generate a non-scheduled rebuild I/O. The 
non-scheduled rebuild I/O operates in a slightly differ- 
ent manner than the scheduled rebuild I/O. Since the 

30 required number of scheduled rebuild l/O's has been 
completed for the current time period, any extra non- 
scheduled rebuild I/O activity must not affect the per- 
formance of regular I/O activity. Therefore, the non- 
scheduled rebuild I/O is immediately terminated by 

35 the arrival of a regular I/O even though itmay not have 
finished rebuilding the first track (or rebuild data unit) 
of the rebuild I/O. After completion of the rebuild of 
one track, the procedure can continue with the next 
track (or rebuild data unit) until terminated by the ar- 

40 rival of a regular I/O. 

Opportunistic: In addition to scheduled aid non- 
scheduled rebuilds, the array subsystem may gener- 
ate rebuild l/O's when the opportunity presents itself. 
In this case, when a regular i/O requests data from the 

45 failed drive, the array subsystem rebuilds the entire 
track (or rebuild data unit) containing the requested 
data (if the requested data is larger than a track, then 
all of the tracks that contain the requested data are re- 
built). Since all the access mechanisms of the related 

so drives are already in place to reconstruct the request- 
ed data, there is no additional overhead to recon- 
struct the entire track. Combining a regular I/O and a 
rebuild I/O into a single I/O has the effect of reducing 
the total I/O rate. Thus, the impact of rebuilding on the 

55 overall, performance of the subsystem is minimised. 

Again, in order to make efficient use of the array 
subsystem's idie time, at the end of a combined reg- 
ular and rebuild I/O, if there is no regular I/O waiting 
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for any of the drives in the parity group, the rebuild I/O 
can be allowed to continue on to the next track until 
terminated by the arrival of the next regular I/O. 

For an opportunistic rebuild I/O, where to rebuild 
is clear, viz., rebuild the track that the regular I/O is ac- 
cessing. For a scheduled or non-scheduled rebuild 
I/O, it is preferred to select any track (or rebuild data 
unit) from the cylinder for which the maximum of the 
seeks of all the drives in the associated parity group 
is minimised. For a striped mode array, since the ac- 
cess mechanisms of all the drives in a parity group are 
always over one single logical cylinder, this would be 
the current-cylinder. For an independent mode array, 
let C1, C2, Ck be the current cylinders of the k 
drives in the parity group. Without loss of generality, 
if, C1 < C2 < ... < Ck, then the rebuild cylinder C 
should be(C1 + Ck)/2. If all the tracks in that cylinder 
have already been rebuilt, go to the adjacent cylinders 
(C-1, C+1, C-2, ... etc.). 

To determine the rebuild I/O rate analytical mod- 
elling, trace-driven simulation, or actual measure- 
ment on a real system can be used to determine the 
overall performance (average response time vs. total 
I/O rate) of the array subsystem for each constant re- 
build I/O rate. 

One approach is to measure the average re- 
sponse time during each period of time directly, if the 
I/O request queues are maintained by the array sub- 
system, then this measurement is done at the subsys- 
tem level. If the I/O request queues are maintained on 
the host, then this measurement is done at the host 
level. In either case, the information is made known 
to the array subsystem which can then adjust the re- 
build I/O rate accordingly. 

In addition to rebuild I/O scheduling activity infor- 
mation, the array subsystem must keep track of which 
tracks have been rebuilt. This information is needed 
so that: 

1. a rebuilt track will not be rebuilt a second time, 
and 

2. if a regular I/O requests data from a track that 
has already been rebuilt, the data can be re- 
trieved from that track directly without reconstruc- 
tion from the other drives in the parity group. 
This rebuild record keeping can be done in a 

straightforward manner using a bit map. This bit map 
will have one word for each cylinder of the disk drive; 
Word 1 corresponds to Cylinder 1, Word 2 corre- 
sponds to Cylinder 2. etc. Within a word, each bit will 
correspond to a track of that cylinder. Bit 1 corre- 
sponds to Track 1 , Bit 2 corresponds to Track 2, etc. 
A "1" in a bit position indicates hat track has not been 
rebuilt, while a "0" indicates that track has already 
been rebuilt. With such an arrangement, it can be 
quickly determined if a cylinder has already been 
completely rebuilt by testing the corresponding word 
for all (Vs. 

The invention therefore provides an efficient pro- 



cedure for rebuilding the data on a failed disk in a re- 
dundant array, the salient features of which are: 

1 . The array system maintains an acceptable lev- 
el of performance for regular l/O's during the re- 

5 build period. 

2. Rebuild activities will take place as long as an 
acceptable level of performance for regular l/O's 
can be maintained, thus guaranteeing that the re- 
build will be completed within a resonable amount 

w of time. 

3. Utilising the array's idle time whenever it be- 
comes available, thus, reducing the rebuild's im- 
pact on the performance of regular I/O services, 
and shortening the total rebuild time. 

15 4. Incorporating an intelligent way of choosing 

where to do the next rebuild so that it can be com- 
pleted faster and more efficiently. 
5. Combining rebuild activity with regular I/O ser- 
vice whenever possible, allowing some of the re- 
20 build work to be done at little extra cost. 

The invention will now be described in further de- 
tail by way of example with reference to the following 
drawings: 

Fig. 1 illustrates in simplified form an information 
25 handling system employing the present invention. 

Fig. 2 is a graph illustrating the principles of vari- 
able rate rebuilding in the Fig. 1 illustrated array of 
disk drives. 

Fig. 3 is a machine operations chart showing de- 
30 tecting errors in the Fig. 1 illustrated array and pri- 
ming the system for rebuilding. 

Fig. 4 is a machine operations chart showing ac- 
tivating any one of three rebuild methods or appara- 
tus. 

35 Fig. 5 is a simplified machine operations chart 

showing selection of a rebuild method and apparatus. 

Fig. 6 is a diagrammatic showing of a disk record- 
ing surface as may be used in practicing the present 
invention. 

40 Fig. 7 is a diagrammatic showing of a data struc- 

ture usable in practicing the present invention. 

Fig. 8 is a diagrammatic showing of a bit map con- 
trol for effecting rebuild. 

Fig. 9 is a machine operations chart showing re- 
45 build using a variable rate method and apparatus. 

Fig. 1 0 is a machine operations chart showing re- 
build using an array idle time method and apparatus. 

Fig. 11 is a machine operations chart showing re- 
build using the opportunistic rebuild method and ap- 
50 paratus. 

Referring now more particularly to the appended 
drawing, like numerals indicate like parts and struc- 
tural features in the various figures. Host processor(s) 
1 0 (Fig. 1) are respectively connected to one or more 
55 controller(s) 11 by host to peripheral interconnection 
12. A plurality of parity arrays 13. 14 and 15 are con- 
nected to controller 1 1 by a usual controller to periph- 
eral device connection 17. Each of the Fig. 1 illustrat- 
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ed arrays 1 3-1 5 include five disk drives 20-24, no I im- 
itation thereto intended. Four of the disk drives 20-23 
store i ike-sized blocks of data of one data unit. The 
block sizes may vary from data unit to data unit. A data 
unit can be an amalgamation of files, one file, graphic 
data, and the like. A fifth disk drive 24 is a parity or er- 
ror detection redundancy storing drive P. The redun- 
dancy is a parity data block having the same size as 
the corresponding data blocks of the data unit. The re- 
dundancy is computed based upon any algorithm, in- 
cluding simple parity for example, using the data in the 
data blocks stored in drives 20-23. All data blocks of 
any data unit in disk drives 20-24 may be stored in the 
same relative tracks in all of the disk drives; i.e. all 
data blocks may be store in track 16 of all the drives, 
for example; while this storage arrangement simpli- 
fies data management, it is not essential as regards 
the present invention. 

Disk drives 20-24 form a parity group, with disk 24 
being dedicated to storing parity blocks. This general 
arrangement is known as RAID 3 and RAID 4 archi- 
tecture, see Patterson et al, supra. Alternately, the 
storage of the parity block scan be rotated among all 
disk drives in the parity group with no single drive be- 
ing designated as the parity drive. This latter storage 
arrangement is known as RAID 5 architecture. The 
present invention is equally applicable to any of these 
architectures, plus other architectures. 

When storing data to any one or more of the disk 
drives 20-23, a new parity value is computed and stor- 
ed in disk drive P 24. For efficiency purposes, it is de- 
sired to simultaneously record data in all four disk 
drives 20-23, and compute parity and record parity on 
disk drive P 24 as the data are being stored on the 
data drives; in a rotated parity arrangement, parity 
data are stored in the appropriate disk drive. 

Host processors 1 0 and controller 11 both partic- 
ulate in the above-described machine operations; it is 
to be understood that the combination of host proces- 
sors 10 and controller 11 comprises a computer 
means in which programming resides; such program- 
ming is represented by the machine operation charts 
of the various figures. The programming can be a sep- 
arate part of the Fig. 1 illustrated system or can be em- 
bodied in ROM, loadable software modules, and the 
like. 

Rebuilding data in a parity group on a disk drive 
that replaces a failed disk drive, either a spare or true 
replacement is achieved using a combination of ap- 
proaches. Scheduled rebuilds are controlled by a va- 
riable rebuild rate method, an idle time rebuild occurs 
during any idle time of the parity array, and an oppor- 
tunistic rebuild method is invoked upon each access 
to a replacement drive for the failed drive for access- 
ing a non-built data storing area. This description as- 
sumes that a failed drive (eg a drive having a number 
of non-recordable data storing tracks/clusters of sec- 
tors, a failed mechanical part that prevents accessing 



or the like) has been replaced with a scratch drive or 
disk using known disk drive and disk replacement pro- 
cedures. 

Before proceeding with a detailed description , 

5 the principle of the variable rate rebuild method is de- 
scribed with respect to Fig. 2. When this method is ac- 
tive, rebuild disk accesses (input/output operations or 
l/O's) are commanded or scheduled at a rate which 
varies inversely to the current level of I/O activity. Fig. 

10 2 illustrates how the rate of rebuild scheduling is as- 
certained. Such rebuilding is interleaved with host 
processor 1 0 or other disk drive accesses, as will be- 
come apparent. A desired response time T is deter- 
mined for the parity group to be managed. Such a re- 

15 sponse time is determined using known system ana- 
lysts techniques, or the rate can be arbitrary and cap- 
ricious. The five curves 30, also respectively labelled 
1-5, show the variation of average response time 
(vertical ordinate) with the total I/O rate represented 

20 on the horizontal ordinate. The total I/O rate is deter- 
mined by the activity of the host processors 10. The 
I/O rate is repeatedly monitored in predetermined 
constant measurement periods. The measured I/O 
rate determines the rebuild rate for the next ensuing 

25 measurement period. The measured rate during each 
mesurement period is the computed average I/O rate 
for the measurement period of the parity group. When 
the I/O rate is higher than R1, then no rebuilds are 
scheduled during the next ensuing measurement per- 

30 iod. During such a measurement period rebuilds may 
occur using either the idle or the opportunistic rebuild 
methods. The rebuild schedule rate for one measure- 
ment period is next listed using the Fig. 2 chart as a 
guide. For an I/O rate between R1 and R2, one rebuild 

35 is scheduled; upon measuring an I/O rate between R3 
and R2, two rebuilds are scheduled; a measured I/O 
rate between R4 and R3 results in three rebuilds be- 
ing scheduled; a measured I/O rate between R5 and 
R4 results in four rebuilds being scheduled while low- 

40 er I/O rates than R5 result in five rebuilds being 
scheduled. The maximum number of scheduled re- 
builds is five; any number can be used as the maxi- 
mum. In the illustrated embodiment, a minimum size 
rebuild is one track. The information represented by 

45 the Fig. 2 chart is maintained for the parity array in the 
computer means for effecting scheduled rebuilds. 

Fig. 3 illustrates reading data from one of the par- 
ity arrays 13-15, detecting rebuild criteria and priming 
the Fig. 1 illustrated system for rebuilding data. Ausu- 

50 al read operation occurs at machine step 35 as initi- 
ated by other machine operations. At machine step 36 
controller 11 (disk drives may contain error detecting 
facilities as well as the controller or host processors) 
detects errors in the data read from any of the disk 

55 drives 20-23; such errors are attempted to be correct- 
ed in a usual manner. At machine decision step 37, 
controller 11 determines whether or not the error cor- 
rections by the error redundancies in the individual 
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disk drives 20-23 were successful and whether or not 
fault tolerance was degraded even with a successful 
error correction. If error corrections were successful 
(high quality redundancy may still be indicated for 
some purposes and degraded redundancy may be in- 
dicated for other purposes, as will become apparent), 
then, assuming fault tolerant redundancy is not de- 
graded for requiring a rebuild (NO degradation detect- 
ed in machine step 37), machine operations proceed 
to other operations; no rebuild activity is indicated. On 
the other hand, if any one of the disk drives did not 
yield correctable data errors, which include a failure 
to respond, fault tolerance degradation is indicated. 
With the parity disk P 24, such data errors can still be 
corrected by reading the parity redundancy of the 
block from disk drive P 24, then computing the correct 
data from the data successfully read from the other 
drives and the parity redundancy. To achieve this 
known parity correction, the parity block stored in disk 
drive P 24 is read into controller 11 . Then the data are 
corrected in machine step 39 using the known parity 
correction procedures. Such correction can occur in 
either a host processor 10 or in controller 11. At this 
point in time, the redundancy for the data unit being 
read has been removed. Then, at machine step 40, 
the parity correcting unit (host processor 1 0/controller 
11) determines whether or not the parity correction is 
successful. Whenever the parity correction is unsuc- 
cessful, a subsystem error is flagged in a usual man- 
ner. Then, recovery procedures beyond the present 
description are required. Whenever the parity correc- 
tion is successful, then at machine step 41, if it is de- 
termined that there is insufficient degradation of the 
fault tolerance effecting redundancy, other machine 
operations are performed; if it is determined that fault 
tolerance is unacceptable (a disk has failed, for exam- 
ple), then a rebuild is indicated. 

Figure 4 illustrates the concepts of maintaining 
the desired redundancy without undue interference 
with day-to-day operations. The general arrangement 
of Fig. 4 can be thought of as establishing an interrupt 
driven examination of rebuild needs in a system. Ma- 
chine step 45 represents monitoring an indicating I/O 
(input output) rate of operations for each parity group 
13-15 of disk drives. At predetermined times, as will 
become apparent from Fig. 9, from such rate monitor- 
ing and indicating, a rebuild need is detected at ma- 
chine step 46. Such detection may merely be a global 
rebuild flag or any entry in any of the Fig. 8 illustrated 
bit maps. If a rebuild is needed, then at machine step 
47 a later described variable rate rebuild is scheduled. 
If a rebuild is not needed, then other machine other 
operations are performed. 

Similarly, machine step 50 represents monitoring 
for idle time for idle time in any one of the parity groups 
13-15. If idle time is detected, such as no pending ac- 
cess requests nor free standing operations are being 
performed, then machine step 51 represents detect- 



ing a rebuild need. When a rebuild need is detected, 
then at machine step 52 a later-described idle time re- 
build is effected. If no rebuild is required, other ma- 
chine operations ensue. 

5 Likewise, machine step 55 represents monitoring 

for a failed access, read operation or write operation 
in any one of the parity groups 13-15 or any access 
to a known failed drive. Upon detecting such an error, 
a rebuild need may be indicated as described for Fig. 

10 3. Then at machine step 56 the rebuild needs are de- 
tected. On one hand, if the parity correction described 
in Fig. 3 was successful, a rebuild may be deferred, 
then from machine step 56 other operations ensue. If 
a rebuild is required, then the later-described oppor- 

15 tunistic rebuild operations of machine step 57 are per- 
formed. 

It is to be appreciated that the Fig. 4 illustration is 
tutorial; actual practical embodiments may differ in 
substantial details. Interleaving of the rebuild strat- 

20 egies can follow several variations. The determina- 
tion of when and the total extent of a rebuild may sub- 
stantially affect a given design. 

Fig. 5 shows one way of selecting between two of 
the three illustrated data rebuild techniques. The se- 

25 lection procedure is entered at path 60 from other ma- 
chine operations based upon any one of a plurality of 
criteria, such as a time out, time of day, number of ac- 
cesses, the later-described rebuild schedule of the 
variable rate rebuild, whether or not bit map of Fig. 8 

30 indicates any rebuild need and the like. Such a selec- 
tion could typically reside in a dispatcher or other su- 
pervisory program (not shown). At machine decision 
or branching step 61 , the type of rebuild needed to be 
evaluated is selected. Machine step 61 represents a 

35 program loop function controlled by software counter 
62. Entry of the procedure at path 60 resets counter 
62 to a reference state, counter 62 enables the deci- 
sion step 61 to first evaluate an idle rebuild at machine 
step 65 as detailed in Fig. 10. If none of the parity ar- 

40 rays 13-15 are idle or there is no need for any rebuild 
(bit maps of Fig. 8 are all zeros), then operations re- 
turn to machine step 61 1 counter 62 is incremented to 
a next value. This next value causes decision step 61 
to effect evaluation of a variable rate rebuild at ma- 

45 chine step 66 as detailed later in Fig. 9. The rebuild 
scanning may return to Fig. 5 from Fig. 9 to re-execute 
machine step 61 and increment counter 62. Other re- 
build procedures may be employed (not described) as 
represented by numeral 67. Again, upon completing 

so the rebuild evaluation, machine operations returning 
to the Fig. 5 procedure results in another incrementa- 
tion of counter 62 and execution of machine step 61. 
Since the program loop scanning of the procedures 
has been completed, other machine operations are 

55 performed as indicated by numeral 68. The order of 
scanning the rebuild procedures or methods is arbi- 
trary. As shown in Fig, 11, the opportunistic rebuild 
procedure is always entered from a disk accessing 
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operation. Any method of scanning rebuild proce- 
dures may be employed for selecting any one of a 
plurality of rebuild procedures. 

Fig. 6 is a diagrammatic plan view of a disk in any 
of the disk devices 20-24; a plurality of such disks are 
usually stacked to be coaxially co-rotating in the re- 
spective devices. All tracks on the respective disks 
having the same radial location constitute a cylinder 
of such tracks. When employing a traditional fixed 
block architecture, each disk 70 may be envisioned as 
a plurality of disk sector- indicating radially-extending 
machine-sensible lines 71. Each disk sector between 
the lines is addressable as a data-storing unit. In a 
count-key-data (CKD) disk a single radially-extending 
track index line is used. Each disk has a multiplicity of 
addressable circular tracks, or circumvolutions of a 
spiral track, residing on each disk 70. A track 72 may 
be affected by error requiring a partial rebuild of the 
array. In disk 70 the data contents of the error affected 
track 72 may be reassigned to track 73; in a rebuild 
the data contents of all tracks 72 in the respective disk 
device 20-24 are similarly reassigned to their respec- 
tive tracks 73. In one mode, the data contents of a cy- 
linder of tracks in which track 72 may be reassigned 
to a cylinder of tracks including track 73. In another 
mode, only the contents of a single track are reas- 
signed. When a disk device is totally replaced, then 
the data from all of the remaining devices 20-24 are 
used to compute the data for the replaced disk. The 
decision when to replace a disk device that is partially 
operable may be based upon the number of unusable 
or bad tracks on the device, the character of the error 
causing failure, and the like. 

Fig. 7 illustrates a data structure for one imple- 
mentation of the variable rate rebuild method. The ob- 
jective of this method is to maintain during the rebuild 
period at least a minimum level of subsystem perfor- 
mance, i.e. response time to received I/O requests. 
Three registers or data storage locations 80-82, either 
in a host processor 10 or controller 11, store control 
information need to effect the variable rate rebuild 
method in the respective parity arrays 13-15. Each 
register 80-82 is identically constructed, the criteria 
information may be different to accommodate arrays 
having different performance characteristics or sys- 
tem usages. Field rate 83 stores a number indicating 
the rate of rebuild, i.e. one rebuild per second, two per 
second, etc. Field AVG-IO 84 stores the average I/O 
response time, possibly expressed in terms of its cor- 
responding I/O request rate, in a predetermined 
measuring period. The I/O response time or request 
rate is used to compute the rebuild rate. Fields 85-88 
respectively store rebuild rates for the I/O request 
rates T-1 through T-4 for various rebuild rates, no lim- 
itation thereto intended. The total number of disk ac- 
cesses to a parity array is an indication of response; 
the greater the number of access requests, the lower 
the desired rebuild rate. Thresholds T-1 through T-4 



correspond to decreasing numbers of access re- 
quests rates and indicate higher and higher rebuild 
rates. Threshold T-1 indicates an access rate greater 
than which would result in no rebuild being permitted 

5 by the variable rate rebuild method. Threshold T-2 in- 
dicates an access rate greater than T-3 and smaller 
than T-1 and which permits one rebuild access (i.e. re- 
builds one track of data) during a constant request 
rate measuring period. Similarly, threshold T-3 indi- 

w cates an access rate greater than T-4 and smaller 
than T-2 and which permits two rebuild accesses dur- 
ing the constant rebuild rate measuring period. As re- 
quest rates continue to decrease, corresponding in- 
creases in rebuild rates occur. A predetermined max- 

15 imum rebuild rate for the system may be established. 
In another implementation of the variable rate rebuild 
method, an average response time can be directly 
measured during each successive measuring period. 
If the measured response time is slower than a de- 

20 sired response time, the rebuild rate to be used during 
the next successive measuring period is reduced. If 
the measured response time is shorter than the de- 
sired response time, the rebuild rate used in the next 
successive measuring period is increased. Alter nate- 

25 iy, if I/O access queues exist, then the rebuild rate may 
be selected to be inversely proportional to the length 
of the access queues for the respective parity arrays. 
Any of the above-described measurement techniques 
may be employed for establishing the rebuild rate 

30 control information stored in fields 83 and 84. Numer- 
al 89 indicates that additional criteria may be used for 
determining a rebuild rate which is inversely propor- 
tional to accessing/free-standing array operations. 
Information specifying which tracks need rebuild- 

35 ing is maintained in bit maps 95-97 (Fig. 8) respective- 
ly for parity arrays 13-15. The rows 105, 106, 107 ... 
of bit-containing squares 99 respectively indicate sets 
of logical tracks on one recording surface of disk 70 
(Fig. 6). The columns 100, 101, 102 ... of bit-contain- 

40 ing squares 99 respectively represent logical cylin- 
ders of the logical tracks. Each logical track includes 
one physical track in each device 20-24 and each log- 
ical cylinder includes one physical cylinder in each de- 
vice 20-24. When any one of the parity groups 13-15 

45 is providing complete redundancy, then all of the 
squares 99 in the respective bitmap 95-97 contain bi- 
nary 0's. Any track needing a rebuild, whether as part 
of a complete rebuild of a disk device or a partial re- 
build, is indicated by a binary 1 in the square or bit 

so position 99 of the respective bit map. Scanning the bit 
maps for ascertaining rebuild needs follows known 
techniques. An index or written-to-bit-map value (not 
shown) may be used to indicate that a respective bit 
map either contains at least a binary 1, ail. binary 0's 

55 or the number of binary 1 's in each respective bit map. 

Fig. 9 illustrates one implementation of the vari- 
able rate rebuild method, including generating the 
control information for the Fig. 7 data structure. The 



3 



15 



EP 0 519 670 A2 



16 



~ rt f thP oaritv arrays; modifying 
description is for ^^^odate a plurality of 
the machine operat.ons to acc °™ m of severa , 

parity arrays can * ^ the three ar- 

approaches. For exam P^ z 0 e ^° trnap; then only the 
rays 13-15 may have a non-zero^ tm P rf 
array indicated by the non-ze* , biUnap » £ ^ ^ 
lf a plur aiity of b. maps - non » ^ ^ 

building in the three array relative importance 
,east busy array, a ^""^^^ opera- 

° f ^ "X^try ev^entry o? the Fig. 9 * 
tions, and the like, in any 11Q from 

lustra ted - C ^;: P ; ^n S d Option assumes en- 
either Fig. 4 or 5 the prese ^ nQt g 

try from Fig. 5. At ^^^out is sensed, .f a 
measurement ^£ oUl> then at machine 

measurement period d^ not time • ^ ^ 
step 112 the access tally '6"P**. ^ Fo „ owing 
indicating criteria may be ' " W „ J field 83 is 

the update, ^^^^ indication, 
examined along with f^J^* of day the last 
ElapS ed time is computed d J cn as in . 

rebuild forthe pari* e^ 

dicated by numeral 89 ° f F '9j . J ofthe rneth- 
such last * W ^ 

ods being used, the last reDU. a achiev ed 
riable rate rebuild method or the last ebm, 
by the variable rate »bu-d m. **; H ^ lul ^ 
build has not been reached, then the 9 ^ 
scanning procedure is reent^ed. and 
scheduled, then atmach.ne^ep118th y 

track(s) to be rebuilt are sele 
Under and track selecton- J s d.^ 
seek times to reach a traces) to mechanisms 
ed mode array, s.nce , the track acc ^ 

having a closest radial prox.m ty to the c 

ning tracks/cylinders h«^^nt ™ J ^ js 
The track/cylinder to be rebu d is the cy ^ 

seated at a mean ^^^„ tam at.ra- 

vice in the array having it a naving 
dia , inwardmosl , posihon and pos . 
its access mechanism at ^ a rad «y . f ^ 

ition of all devices ,n me «ray ^ scannjng 
f irst device has its «^ device has lts access 
track 72 of Fig. 6. the secon ^ ^ 

mechanism for scanning track ^3 ^ ^ 

operational devices in the parity array 



spective access ^^iStS 

scanning tracks radially between 

72 and 73, then the cylinder rjd ^^Sd. If the midway 
tracks 72and 73 isexam.nedforrebuUd 

5 cylinder has no track to '^ssively in- 

iinders are successively examineo ^ 

creasing radial ^^^Tng the cylinder 
This determination fo,l ° WS r .^^^Jdway cylinder, 
which is next radially outward of the mid J ^ 

10 thence * that ^J^S^V^^ etc. 
then the next radially inward cyl no ^ 

After selecting the cyl.nde ^ macn 
a track(s) in the selected the data 

step 119 . T his ^^rJSSS. m the other 
15 from the correspond^ PlJJJ- computed data 
devices of the , array then ston ^ ^ ^ ^ 
into the selected tra ^ S) h .. U D P osition(s) of the array bit 
bunding the reepec ™ ^T^.e step 120 
map of Fig. 8 is reset to , ^ ^ , s 

20 a check is made or be rebuilt . whenever 
idle and th ere mjt. -J ^ ^ ^ ^ ,„„ 

25 rrrn^m^ 

are performed. variable rebuild rate occurs 

Calculation of the vanab e 
whenever the machine , ** 1 £ Qf 

urement P^^^iit rate, at machine step 
30 computing the des,r ^ to tne parity array are 

12 5 the number of accesses to J ^ ^ 

averaged to obtain an .^^^^ period 
averaging allows varying the mea is 

35 stored in field 84. Then tn be store<J m a 

chine step 126, such accesstaHy y ^ ma _ 

register 80-82 as -P^^^etermined by corn- 
chine step 127 the rebu.ld rate s de ^ ^ 

paring the ^^Z^^^^ to 
40 field 84 value. Then the »«> e least , es s 

the threshold f, eld 85-86 ' ^ 83 asthe ne w 
than the field 84 value ,s s ^ored .n U ^ ^ 

rebuild rate. RememberthatT-l s reb queue 

, f a queue length rate corre- 

build method. Entry into the ^ pgrjty 
over path 129 from Fig. 5 ^ Whethe g ^ 

array is idle and a tr acl< . in J ne parit y ar- 

build is checked at "e ep 1 ^ g ^ 

55 ray is not ^^^SurM over path 135 to 

rebuild, then the ope f ted selection meth- 
thecallensuchastheFig-Smustrat ^ 

od ^yVhen the parity array is .ale w.m 



17 



EP 0 519 670 A2 



18 



then at machine step 131 a cylinder and one of its 
tracks are selected for rebuild. This selection uses the 
aforedescribed selection method. Following selection 
and location of the selected track, machine step 132 
rebuilds the track contents. Machine step 133 then s 
checks to see if the parity array is still idle, if yes steps 
1 31 and 1 32 are repeated until either no more rebuilds 
are needed or the parity array becomes busy. At that 
point other machine operations ensue. 

Fig. 11 illustrates the opportunistic rebuild meth- w 
od. A track access operation is initiated over path 139. 
For purposes of illustration, the ensuing machine step 
140 is an attempted read from the track to be ac- 
cessed. Any access operation may be used, such as 
write, format write, verify, erase and the like. Assum- 15 
ing a read operation, machine step 141 determines 
whether or not a hard (uncorrectable) error has oc- 
curred. Included in the machine step 141 operation is 
detection that the device containing the accessed 
track is already known to have failed and a rebuild is 20 
pending or already in progress for that device. If the 
read produced no hard error, i.e. no errors detected 
or a corrected error occurred, machine step 142 
checks the quality of the read-back operation. Since 
a corrected error may not be repeaped, machine step 25 
142 may not invoke the opportunistic rebuild method, 
choosing to proceed as OK over path 143 to other op- 
erations. If the quality requirements are not met, such 
as determinable by evaluating read signal quality, the 
type and extent of the corrected error, systems re- 30 
quirernents (desired quality of the redundancy in the 
array) and the like, the opportunistic rebuild is initiat- 
ed (the NO exit from step 142). Machine step 144 ef- 
fects a rebuild of the currently accessed track from 
either machine step 141 detecting a hard error orfrom 35 
machine step 142. Rebuilding follows the above- 
described method of rebuild. Upon completing re- 
building the accessed track data in machine step 144, 
in machine step 145 a check is made on whether or 
not the parity array is idle. If the parity array is not idle, 40 
then other operations are performed; if the parity ar- 
ray is idle, then machine step 146 rebuilds a next 
track. Such a rebuild includes selecting a cylinder and 
track followed by the actual rebuild method. Machine 
steps 145 and 146 repeat until either no more rebuilds 45 
are needed (bit maps are all zeros) or the parity array 
becomes active. 



Claims 

1. A method of automatically maintaining fault toler- 
ance in a parity array (13) consisting of a plurality 
of disk drives (20, 21 , 22, 23, 24), comprising the 
machine-executed steps of: 

detecting a degradation in the fault toler- 
ance of the parity array; 

evaluating the current information han- 



dling activity of the parity array; 

establishing a plurality of data rebuild 
methods for removing the fault tolerance degra- 
dation from the parity array; and 

analyzing the current information handling 
activity of the parity array and selecting one of the 
plurality of rebuild methods to ensure that the per- 
formance of said current information handling ac- 
tivity is not degraded below a predetermined lev- 
el. 

2. A method as claimed in claim 1, wherein one of 
said plurality of data rebuild methods comprises 
the machine-executed steps of: 

measuring the rate of machine operations 
of the array; 

establishing the rate of rebuilding data af- 
fected by the error in inverse proportion to the 
measured rate of machine operations; and 

rebuilding the data accordingly. 

3. A method as claimed in claim 1 or 2, wherein one 
of said plurality of data rebuild methods compris- 
es the machine-executed steps of: 

detecting that the parity array is idle; 

detecting that one of of the disk drives is 
affected by error; 

rebuilding the data during said detected 
idle times. 

4. A method as claimed in claim 3 as dependent on 
claim 2, further comprising the machine-execut- 
ed steps of: 

completing a data rebuild using the rebuild 

method of claim 2; 

detecting that the parity array is idle; and 
continuing the data rebuilding of additional 

addressable error-affected data so long as the 

parity array is idle. 

5. A method as cfaimed in any of claims 1 to 4, where 
one of said plurality of data rebuild methods com- 
prises the machine-executed steps of: 

performing a data area access operation 
to an addressable data unit in the parity array; 

while performing the data area access op- 
eration, detecting that the addressed data unit is 
affected by error and in need of a data rebuild; 

rebuilding the addressable data unit being 
accessed. 

6. Apparatus for automatically maintaining fault tol- 
erance in a parity array (13) consisting of a plur- 
ality of disk drives (20, 21, 22, 23, 24), the appa- 
ratus comprising: 

means for detecting a degradation in fault 
tolerance in the parity array such that one of the 
disk drives needs to have data rebuilt; 
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means for measuring a rate of machine op- 
erations of said array; 

means responsive to said measured oper- 
ations rate for establishing a rate of rebuilding for 
the array for recovering from said degradation; 5 
and 

a plurality of data rebuild means for per- 
forming data rebuild in said one disk drive. 

7. Apparatus as claimed in claim 6, further compris- 10 
ing: 

control means (11) for controlling access 
to the disk drives in the parity array and for detect- 
ing when the array is currently not being ac- 
cessed for a data handling operation. 15 

8. Apparatus as claimed in claim 7, wherein one of 
said plurality of data rebuild means is connected 
to said control means and to said rate establish- 
ing means for determining whether a rebuild can 20 
be scheduled, and if so, activating the control 
means to give access to the parity array to per- 
form a series of rebuild operations at the estab- 
lished rate. 

25 

9. Apparatus as claimed in claim 8, wherein said 
rate establishing means further comprises: 

means for indicating a plurality of rebuild 
rates, each corresponding to predetermined 
ranges of machine operations rates and in in- 30 
verse proportion to the operations rate; and 

means for indicating one of said plurality of 
rebuild rates as said established rate correspond- 
ing to the current machine operations rate. 

35 

10. Apparatus as claimed in claim 7, 8 or 9, wherein 
one of said plurality of data rebuild means is con- 
nected to said control means and responsive to 
the parity array not being currently accessed to 
activate the control means to give access to the 40 
parity array to perform a data rebuild. 

11. Apparatus as claimed in any of claims 7 to 10, 
wherein one of said plurality of data rebuild 
means is connected to the control means and re- 45 
sponsive to a disk drive access occurring in an 
area of the array that needs a data rebuild to per- 
form a data rebuild. 



12. Apparatus as claimed in any of claims 7 to 11, so 
wherein each of the data rebuild means upon 
completing a data rebuild operation is responsive 
to the control means indicating no current access 
or no pending access requests to initiate another 
data rebuild operation. 55 
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